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Abstract 



o 

(N , 

j_j I The original k-means clustering method works only if the exact vectors repre- 

senting the data points are known. Therefore calculating the distances from the 
centroids needs vector operations, since the average of abstract data points is un- 
■>!::;j- , defined. Existing algorithms can be extended for those cases when the sole input 

I is the distance matrix, and the exact representing vectors are unknown. This ex- 

tension may be named relational k-means after a notation for a similar algorithm 
. invented for fuzzy clustering. A method is then proposed for generalizing k-means 

I for scenarios when the data points have absolutely no connection with a Euclidean 

c/3 ' space. 

1 Introduction 

I ■ 
o , 

O . The standard k-means method [1] takes a set of data points pi, ...pn G M.'^ and a number 

of clusters A^. Its aim is to produce an arrangement of the data points into clusters 
(that is, a labeling function i : {pi}'^^^ — )■ {1,...A^}) so that the following objective is 



' minimized: 

where Zi = T.jes,Pj^ and Si = {pj : e{pj) = i{pi)}. 

The main difficulty of this method is that it requires the data points to be the elements 
of a Euclidean space, since we need to average the data points somehow. In practice we 
often have data points (e.g., protein sequences) and a distance function which is not 
derived from some Euclidean representation. Even worse, the distance function may not 
be a metric at all. Clustering schemes like k-means are not applicable for these cases, as 
k-means requires vectors as input. 

Various generalizations and extensions of k-means have been developed jl] [2], but none 
yet seems to have addressed the above problem. However, the fuzzy c-means clustering 
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method is reported to have been successfully generalized fH]. The generalized method is 
known as Non- Euclidean Relational Fuzzy C-means (NERF c-means). A similar extension 
of k-means, which can be viewed as a vast simplification of NERF c-means, is described 
in the next sections. 



2 Relational k-means 

Suppose first that we have a Euclidean distance matrix between the data points, but 
the exact location of the vectors representing them is unknown for us. Let A G M"^" 
be the squared distance matrix, namely, Aij = \\pi —pjW^- Our objective is to calculate 
the squared norms \\pi — The pi — Zi distance vectors are a special case of those 

linear combinations of the pj points where the sum of the coefficients is zero. That is. 
Pi — Zi = X]j=i ^jPj f*^^ some suitable A G M", which satisfies the condition X]j=i ~ ^■ 

In fact, it can be easily verified that the squared length of Y17=i ^iPi '^^^ calculated 
by knowing only the matrix A: 



n n n ^ n n ^ 

1=1 i=l j=l 1=1 j=l 

In the above transformation we made use of the fact that ^17=1 = ^■ 

Calculating a centroid distance is thus possible by computing a quadratic form. This 
means that, even if the only thing we know is the squared distance matrix A, we can run 
practically any k-means heuristic without substantial modifications. Of course, the time 
complexity will be impaired, as computing a quadratic form is an expensive operation. 



3 The non-Euclidean case 

Let Cj denote the ith standard basis vector, and, for an index set S C {1, ...n} let x{S) '■ = 
J2ies^i- -^s '■= 1^ X^jesPj denote a centroid. The formula d'^{pi, Zs) := —\X^ AX 

(where A := |^x('S') — Cj) still makes sense even if A has not been derived from Euclidean 
distances. Therefore, the above formula yields a generalization of the centroid distances. 

This means that now we can speak of the weighted arithmetic mean of abstract data 
points in a sense that there is a possible interpretation of distance between two objects 
of that kind. 

The above generalization shows that any k-means algorithm can be adapted to abstract 
distances. It is questionable though whether this generalized clustering method yields 
interesting and useful clusters. A completely arbitrary matrix A can produce strange 
results. That is, if —^X^ AX takes a negative value for some vector A (the sum of whose 
coordinates is zero), then the distance defined by A will be negative. Of course, this is 
not possible in the case of a Euclidean squared distance matrix. 
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Negative distances can be eliminated by ensuring that Ai is negative definite, where 
Ai is the restriction of the quadratic form A to the hnear hyperplane perpendicular to 
1. (1 is the vector whose coordinates are all 1.) This may require a modification to the 
original squared distance matrix, a modification that should be as small as possible in 
some sense. 

A method proposed in [3] called /3-spread transformation may be applicable here as 
well. That is, all the pairwise distances are gradually increased by the same amount until 
we have a matrix of the desired kind. This approach was reported to work well with fuzzy 
c-means for real-world data. The real-world suitability of an analogous matrix correction 
method for generalized k-means is yet to be evaluated. 
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